Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Job objects to postgres #368

Merged
merged 15 commits into from
Feb 5, 2024

Conversation

msm-code
Copy link
Contributor

@msm-code msm-code commented Feb 4, 2024

Your checklist for this pull request

  • I've read the contributing guideline.
  • I've tested my changes by building and running mquery, and testing changed functionality (if applicable)
  • I've added automated tests for my change (if applicable, optional)
  • [n/a] I've updated documentation to reflect my change (if applicable)

What is the current behaviour?
Job objects are kept in redis

What is the new behaviour?
Job objects are kept in postgres.

I'm consciously trying to keep everything very self-contained and don't want to change a single line outside of the Database abstraction. It's very useful to have this layer when turning around the whole storage layer.

Related to #74

@msm-code msm-code changed the title Feature/postgres job Move Job objects to postgres Feb 4, 2024
@msm-cert msm-cert requested a review from nazywam February 5, 2024 12:03
docs/redis.md Outdated Show resolved Hide resolved
src/app.py Outdated
@@ -492,11 +492,11 @@ def matches(

@app.get(
"/api/job/{job_id}",
response_model=JobSchema,
response_model=Job,
Copy link
Member

@nazywam nazywam Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're fine with returning the whole row from db, right? I.e. there's no sensitive stuff there that users shouldn't have access to

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Because if that was the case, we could define a schema for reading the job object)

Copy link
Member

@msm-cert msm-cert Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, for postgres migration I had to define internal_id for the Job object. It's not really sensitive, but by design it's not necessary for anything. I'll think about a Read object

@@ -67,55 +68,36 @@ def __schedule(self, agent: str, task: Any, *args: Any) -> None:

def get_job_ids(self) -> List[JobId]:
"""Gets IDs of all jobs in the database"""
return [key[4:] for key in self.redis.keys("job:*")]
with Session(self.engine) as session:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're probably gonna be writing this a lot, should we maybe create a helper decorator like with self.get_session()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it'll be useful to even have something like

self.execute(update(Job).where(...))

to handle commits etc automaticaly (this is a very common pattern). I'll do a separate refactor PR in a while

@@ -67,55 +68,36 @@ def __schedule(self, agent: str, task: Any, *args: Any) -> None:

def get_job_ids(self) -> List[JobId]:
"""Gets IDs of all jobs in the database"""
return [key[4:] for key in self.redis.keys("job:*")]
with Session(self.engine) as session:
jobs = session.exec(select(Job)).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably could optimize this by selecting just the required row. But if there's not that many jobs and no automatic table joins it's probably not worth the effort at this moment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean just the required collumn? Yeah, I've thought about this but when I wrote it I only wanted to have it working and go back to it later.

It's not worth refactoring IMO, because this function is a wrong abstraction. It's used in some places, for example like this:

for job in db.get_job_ids():
    job.get_job()

I didn't change it right away to avoid changes outside of the db.py but the database abstraction need to be changed significantly later.

{"status": "cancelled", "finished": int(time())},
)
def cancel_job(self, job: JobId, error=None) -> None:
"""Sets the job status to cancelled, with optional error message"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the error message though?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice (refactoring gone wrong)


def get_job(self, job: JobId) -> JobSchema:
def get_job(self, job: JobId) -> Job:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the correct typing should be Optional[Job] most likely?

And then we'd want to use one_or_none().

Unless sqlalchemy.orm.exc.NoResultFound is handled somewhere down the line

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not handled, but this should raise an exception if the job with a specified ID doesn't exist. Does it make sense? 🤔 It's used mostly in context where the UID is known to be good, or when it's user input and if it's bad it should fail

self.redis.hmset(f"job:{job}", {"status": "removed"})
with Session(self.engine) as session:
session.execute(
update(Job).where(Job.id == job).values(status="removed")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some kind of enum/consts for the query statuses at some point probably?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I have it in the TODO, saving it for a refactor 🙏 (I'll need to check how to do this in sqlalchemy but I'll check other projects I guess)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created an issue out of it: #370

"taints": json.dumps(taints),
}
with Session(self.engine) as session:
obj = Job(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could theoretically provide the defaults in the model declaration, but since this is the only initialization at this moment it's probably not worth it at the moment?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question 🤔 depends on which one is clearer. Having defaults in the model for things like files_in_progress makes sense. When have a proper database I consider splitting this into several smaller objects (to avoid one huge job object that counts everything).

@msm-code msm-code force-pushed the feature/postgres-job branch from f5b724b to eae7f24 Compare February 5, 2024 16:30
@msm-cert msm-cert merged commit 9d06fe5 into CERT-Polska:master Feb 5, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants